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Abstract 

Accurately  estimating  the  person’s  head  position  and  ori¬ 
entation  is  an  important  task  for  a  wide  range  of  applica¬ 
tions  such  as  driver  awareness  and  human-robot  interac¬ 
tion.  Over  the  past  two  decades,  many  approaches  have 
been  suggested  to  solve  this  problem,  each  with  its  own 
advantages  and  disadvantages.  In  this  paper,  we  present 
a  probabilistic  framework  called  Monocular  Adaptive 
View-based  Appearance  Model  (MAVAM)  which  inte¬ 
grates  the  advantages  from  two  of  these  approaches:  (1) 
the  relative  precision  and  user-independence  of  differen¬ 
tial  registration,  and  (2)  the  robustness  and  bounded  drift 
of  keyframe  tracking.  In  our  experiments,  we  show  how 
the  MAVAM  model  can  be  used  to  estimate  head  posi¬ 
tion  and  orientation  in  real-time  using  a  simple  monocu¬ 
lar  camera.  Our  experiments  on  two  previously  published 
datasets  show  that  the  MAVAM  framework  can  accurately 
track  for  a  long  period  of  time  (>2  minutes)  with  an  aver¬ 
age  accuracy  of  3.9°  and  1.2in  with  an  inertial  sensor  and 
a  3D  magnetic  sensor. 

1  Introduction 

Real-time,  robust  head  pose  estimation  algorithms  have 
the  potential  to  greatly  advance  the  fields  of  human- 
computer  and  human-robot  interaction.  Possible  appli¬ 
cations  include  novel  computer  input  devices  (Fu  and 
Huang,  2007),  head  gesture  recognition,  driver  fatigue 
recognition  systems  (Baker  et  al.,  2004),  attention  aware¬ 
ness  for  intelligent  tutoring  systems,  and  social  interac¬ 


tion  analysis.  Pose  estimation  may  also  benefit  secondary 
face  analysis,  such  as  facial  expression  recognition  and 
eye  gaze  estimation,  by  allowing  the  3D  face  to  be  warped 
to  a  canonical  frontal  view  prior  to  further  processing. 

Two  main  paradigms  exist  for  automatically  estimat¬ 
ing  head  pose.  Dynamic  approaches,  also  called  differen¬ 
tial  or  motion-based  approaches,  track  the  position  and 
orientation  of  the  head  through  video  sequences  using 
pair-wise  registration  (i.e.,  transformation  between  two 
frames).  Their  strength  is  user-independence  and  higher 
precision  for  relative  pose  in  short  time  scales,  but  they 
are  typically  susceptible  to  long  time  scale  accuracy  drift 
due  to  accumulated  uncertainty  over  time.  They  also 
usually  require  the  initial  position  and  pose  of  the  head 
to  be  set  either  manually  or  using  a  supplemental  au¬ 
tomatic  pose  detector,  keyframe-based  approaches,  also 
called  template-based  approaches,  use  information  previ¬ 
ously  acquired  about  the  user  (automatically  or  manually) 
to  estimate  the  head  position  and  orientation.  These  ap¬ 
proaches  are  more  accurate  and  suffer  only  bounded  drift 
over  time,  but  they  lack  the  relative  precision  of  dynamic 
approaches. 

In  this  paper  we  present  a  Monocular  Adaptive  View- 
based  Appearance  Model  (MAVAM)  which  integrates 
these  two  estimation  paradigms  described  above  in  one 
probabilistic  framework.  The  proposed  approach  has  the 
high  precision  of  a  motion-based  tracker  and  does  not  drift 
over  time.  MAVAM  was  specifically  designed  to  estimate 
6  degrees-of-freedom  (DOF)  of  head  pose  in  real-time 
from  a  single  monocular  camera  with  known  internal  cal¬ 
ibration  parameters  (i.e.,  focal  length  and  image  center). 

The  following  section  describes  previous  work  in  head 


1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

DEC  2008 

2.  REPORT  TYPE 

N/A 

3.  DATES  COVERED 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Real-Time  Head  Pose  Estimation  Using  A  Webcam:  Monocular  Adaptive 

5b.  GRANT  NUMBER 

v  icw-Dascu  /appeal  ante  iviuuci 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

USC  Institute  for  Creative  Technologies  Marina  del  Rey,  CA  90292 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release,  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

See  also  ADM002187.  Proceedings  of  the  Army  Science  Conference  (26th)  Held  in  Orlando,  Florida  on  1-4 
December  2008,  The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

uu 

18.  NUMBER 
OF  PAGES 

8 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


pose  estimation  and  explains  the  difference  between 
MAVAM  and  other  integration  frameworks.  Section  3 
describes  formally  our  view-based  appearance  model 
(MAVAM)  and  how  it  is  adapted  automatically  over  time. 
Section  4  explains  the  details  of  the  estimation  algorithms 
used  to  apply  MAVAM  to  head  pose  tracking.  Section  5 
describes  our  experimental  methodology  and  show  our 
comparative  results. 


2  Previous  Work 

Over  the  past  two  decades,  many  techniques  have  been 
developed  for  estimating  head  pose.  Very  accurate  shape 
models  are  possible  using  the  Active  Appearance  Model 
(AAM)  methodology  (Cootes  et  ah,  2001),  such  as  was 
applied  to  3D  head  data  in  (Blanz  and  Vetter,  1999).  How¬ 
ever,  tracking  3D  AAMs  with  monocular  intensity  images 
is  currently  a  time-consuming  process,  and  requires  that 
the  trained  model  be  general  enough  to  include  the  class 
of  the  user  being  tracked. 

Early  work  in  the  dynamic  paradigm  assumed  sim¬ 
ple  shape  models  (e.g.,  planar(Black  and  Yacoob,  1995), 
cylindrical(La  Cascia  et  ah,  2000),  or  ellipsoidaKBasu 
et  ah,  1996)).  Tracking  can  also  be  performed  with  a  3D 
face  texture  mesh  (Schodl  et  ah,  1998)  or  3D  face  feature 
mesh  (Wiskott  et  ah,  1997).  Some  recent  work  looked 
morphable  models  rather  than  rigid  models  (Brand,  2001; 
Bregler  et  ah,  2000;  Torresani  and  Hertzmann,  2004). 
Differential  registration  algorithms  are  known  for  user- 
independence  and  high  precision  for  short  time  scale  es¬ 
timates  of  pose  change,  but  they  are  typically  susceptible 
to  long  time  scale  accuracy  drift  due  to  accumulated  un¬ 
certainty  over  time. 

Some  earlier  work  in  keyframe-based  paradigm  include 
nearest-neighbors  prototype  methods  (Wu  and  Trivedi, 
2005;  Fu  and  Huang,  2006)  and  template-based  ap¬ 
proaches  (Kjeldsen,  2001).  Vacchetti  et  al.  suggested  a 
method  to  merge  online  and  offline  keyframes  for  stable 
3D  tracking  (Vacchetti  et  ah,  2003).  These  approaches  are 
more  accurate  and  suffer  only  bounded  drift  over  time,  but 
they  lack  the  relative  precision  of  dynamic  approaches. 

Morency  et  al.  (Morency  et  ah,  2003)  presented  the 
Adaptive  View-based  Appearance  Model  (AVAM)  for 
head  tracking  from  stereo  images.  MAVAM  general¬ 
izes  the  AVAM  approach  by  operating  on  intensity  im- 


Figure  1:  Monocular  Adaptive  View-based  Appear¬ 
ance  Model  (MAVAM).  The  pose  of  the  current  frame 
xt  is  estimated  using  the  pose-change  measurements  from 
two  paradigms:  differential  tracking  y\_  1 ,  and  keyframe 
tracking  y*k2 .  During  the  same  pose  update  process  (de¬ 
scribed  in  Section  3.3),  the  poses  {x^,  Xk2,  •••}  from 
keyframes  acquired  online  will  be  automatically  adapted. 


ages  from  a  single  monocular  camera.  This  generalization 
faced  two  difficult  challenges: 


•  Segmenting  the  face  and  selecting  base  frame  set 
without  any  depth  information  by  using  a  multiple 
face  hypotheses  approach  (described  in  Section  3.1). 


•  Computing  accurate  pose-change  estimation  be¬ 
tween  two  frames  with  only  intensity  images  using 
iterative  Normal  Flow  Constraint  (described  in  Sec¬ 
tion  4.1); 


MAVAM  also  includes  some  new  functionality  such  as  the 
keyframe  management  and  a  4D  pose  tessellation  space 
for  the  keyframe  acquisition  (see  Section  3.4  for  details). 
The  following  two  sections  formally  describe  this  gener¬ 
alization. 
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3  Monocular  Adaptive  View-based 
Appearance  Model 

The  two  main  components  of  the  Monocular  Adap¬ 
tive  View-based  Appearance  Model  (MAVAM)  are  the 
view-based  appearance  model  A4  which  is  acquired  and 
adapted  over  time,  and  the  series  of  change -pose  measure¬ 
ments  y  estimated  every  time  a  new  frame  is  grabbed. 
Figure  1  shows  an  overview  our  MAVAM  framework.  Al¬ 
gorithm  1  presents  a  high-level  overview  of  the  main  steps 
for  head  pose  estimation  using  MAVAM. 

A  conventional  view-based  appearance  model  (Cootes 
et  al.,  2002)  consists  of  different  views  of  the  same  object 
of  interest  (e.g.,  images  representing  the  head  at  differ¬ 
ent  orientations).  MAVAM  extends  the  concept  of  view- 
based  appearance  model  by  associating  a  pose  and  covari¬ 
ance  with  each  view.  Our  view-based  model  A4  is  for¬ 
mally  defined  as 

M  =  {{Ii,Xi},Ax} 

where  each  view  i  is  represented  by  /,  and  Xi  which 
are  respectively  the  intensity  image  and  its  associated 
pose  modeled  with  a  Gaussian  distribution,  and  Ax 
is  the  covariance  matrix  over  all  random  variables  Xi. 
For  each  pose  Xi,  there  exist  a  sub-matrix  Ax.  in  the 
diagonal  of  Ax  that  represents  the  covariance  of  the 
pose  Xi.  The  poses  are  6  dimensional  vector  con¬ 
sisting  of  the  translation  and  the  three  Euler  angles 
[  Tx  Tv  Tz  f P  Qz  ] .  The  pose  estimates  in 

our  view-based  model  will  be  adapted  using  the  Kalman 
filter  update  with  pose  change  measurements  y  as  ob¬ 
servations  and  the  concatenated  poses  as  the  state  vector. 
Section  3.3  describes  this  adaptation  process  in  detail. 

The  views  (Jj,  xf)  represent  the  object  of  interest  (i.e., 
the  head)  as  it  appears  from  different  angles  and  depths. 
Different  pose  estimation  paradigms  will  use  different 
type  of  views: 

•  A  differential  tracker  will  use  only  two  views: 
the  current  frame  ( It,xt )  and  the  previous 
frame  (It-i,  xt-\). 

•  In  a  keyframe-based  (or  template-based)  approach 
there  will  be  1  +  n  views:  the  current  frame  (It.  xt) 
and  the  j  =  1  ...n  keyframes  {Ik  ,Xk  }■  Note  that 


Algorithm  1  Tracking  with  a  Monocular  Adaptive  View- 
based  Appearance  Model  (MAVAM). 
for  each  new  frame  (/t)  do 

Base  Frame  Set  Selection:  Select  the  rib  most  sim¬ 
ilar  keyframes  to  the  current  current  frame  and  add 
them  to  the  base  frame  set.  Always  include  the  pre¬ 
vious  frame  (It-i,Xt-i)  in  the  base  frame  set  (see 
Section  3.1); 

Pose-change  measurements:  For  each  base  frame, 
compute  the  relative  transformation  yl,  and  its  co- 
variance  Ayt ,  between  the  current  frame  and  the  base 
frame  (see  Sections  3.2  and  4  for  details); 

Model  adaptation  and  pose  estimation:  Simulta¬ 
neously  update  the  pose  of  all  keyframes  and  com¬ 
pute  the  current  pose  xt  by  solving  Equations  1  and  2 
given  the  pose-change  measurements  {yl,  Ayt }  (see 
Section  3.3); 

Online  keyframe  acquisition  and  management: 

Ensure  a  constant  tessellation  of  the  pose  space  in  the 
view-based  model  by  adding  new  frames  ( It ,  Xt)  as 
keyframe  if  different  from  any  other  view  in  Ai.  and 
by  removing  redundant  keyframes  after  the  model 
adaptation  (see  Section  3.4). 
end  for 


MAVAM  acquires  keyframes  online  and  MAVAM 
adapts  the  poses  of  these  keyframes  during  tracking 
so  n,  {xk  }  and  A ^  change  over  time. 

Since  MAVAM  integrates  two  estimation  paradigms,  its 
view-based  model  M  consists  of  2  +  n  views:  the  cur¬ 
rent  frame  (It,xt),  the  previous  frame  (It_ i,xt-i),  and 
n  keyframe  views  {Ik  ,xk  },  where  j  =  1  ...n.  The 
keyframes  are  selected  online  to  best  represent  the  head 
under  different  orientation  and  position.  Section  3.4  will 
describe  the  details  of  this  tessellation. 

3.1  Base  Frame  Set  Selection 

The  goal  of  the  base  frame  set  into  selection  process  is  to 
find  a  subset  of  views  (base  frames )  in  the  current  view- 
based  appearance  model  A4  that  are  similar  in  appearance 
(and  implicitly  in  pose)  to  the  current  frame  It.  This  step 
reduces  the  computation  time  since  pose-change  measure¬ 
ments  will  be  computed  only  on  this  subset. 
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To  perform  good  base  frame  set  selection  (and  pose- 
change  measurements)  we  need  to  segment  the  face  in  the 
current  frame.  In  the  original  AVAM  algorithm  (Morency 
et  al.,  2003),  face  segmentation  was  simplified  by  using 
the  depth  images  from  the  stereo  camera;  with  only  an 
approximate  estimate  of  the  2D  position  of  the  face  and 
a  simple  3D  model  of  the  head  (i.e.,  a  3D  box),  AVAM 
was  able  to  segment  the  face.  Since  MAVAM  uses  only 
a  monocular  camera  model,  its  base  frame  set  selection 
algorithm  is  necessarily  more  sophisticated.  Algorithm  2 
summarizes  the  base  frame  set  selection  process. 


Algorithm  2  Base  Frame  Set  Selection  Given  the  cur¬ 
rent  frame  It  and  view-based  model  A4,  returns  a  set  of 
selected  base  frames  { Is,  £s}. 

Create  face  hypotheses  for  current  frame  Based  on 
the  previous  frame  pose  Xt- i  and  its  associated  co- 
variance  create  a  set  of  face  hypotheses  for  the 

current  frame  (see  Section  3.1  for  details).  Each  face 
hypothesis  is  composed  of  a  2D  coordinate  and  and  a 
scale  factor  representing  the  face  center  and  its  approx¬ 
imate  depth. 

for  each  keyframe  (Ik  ,  Xk,  )  do 

Compute  face  segmentation  in  keyframe  Position 
the  ellipsoid  head  model  (see  Section  4.1)  at  pose 
XKj ,  back-project  in  image  plane  Ik,  and  compute 
valid  face  pixels 

for  each  current  frames  face  hypothesis  do 

Align  current  frame  Based  on  the  face  hypoth¬ 
esis,  scale  and  translate  the  current  image  to  be 
aligned  with  center  of  the  keyframe  face  segmen¬ 
tation. 

Compute  distance  Compute  the  L2-norm  dis¬ 
tance  between  keyframe  and  the  aligned  current 
frame  for  all  valid  pixel  from  the  keyframe  face 
segmentation. 

end  for 

Select  face  hypothesis  The  face  hypothesis  with 
the  smallest  distance  is  selected  to  represent  this 
keyframe. 

end  for 

Base  frame  set  selection  Based  on  their  correlation 
scores,  add  the  rib  best  keyframes  in  the  base  frame 
set.  Note  that  the  previous  frame  (It-i,Xt-i)  is  always 
added  to  the  base  frame  set. 


The  ellipsoid  head  model  used  to  create  the  face  mask 
for  each  keyframe  is  a  half  ellipsoid  with  the  dimensions 
of  an  average  head  (see  Section  4.1  for  more  details).  The 
ellipsoid  is  rotated  and  translated  based  on  the  keyframe 
pose  XKj  and  then  projected  in  the  image  plane  using  the 
camera’s  internal  calibration  parameters  (focal  length  and 
image  center). 

The  face  hypotheses  set  represents  different  positions 
and  scales  of  where  the  face  could  be  in  the  current  frame. 
The  first  hypothesis  is  created  by  projecting  pose  Xt-i 
from  the  previous  frame  in  the  image  plane  of  the  cur¬ 
rent  frame.  Face  hypotheses  are  created  around  this  first 
hypothesis  based  on  the  trace  of  the  previous  pose  co- 
variance  tr( AXt  l).  If  tr( AXt_j)  is  larger  than  a  preset 
threshold,  face  hypotheses  are  created  around  the  first  hy¬ 
pothesis  with  increments  of  one  pixel  along  both  image 
plane  axes  and  of  0.2  meters  along  the  Z  axis.  Thresholds 
were  set  based  on  preliminary  experiments  and  the  same 
values  used  for  all  experiments.  For  each  face  hypothesis 
and  each  keyframe,  a  F2-norm  distance  is  computed  and 
the  rib  best  keyframes  are  then  selected  to  be  added  in  the 
base  frame  set.  The  previous  frame  (It_i,xt-i)  is  always 
added  to  the  base  frame  set. 

3.2  Pose-Change  Measurements 

Pose-change  measurements  are  relative  pose  differences 
between  the  current  frame  and  one  of  the  other  views  in 
our  model  AT  We  presume  that  each  pose-change  mea¬ 
surement  is  probabilistically  drawn  from  a  Gaussian  dis¬ 
tribution  M(y\\xt  —  xs,Ayt).  By  definition  pose  incre¬ 
ments  have  to  be  additive,  thus  pose-changes  are  assumed 
to  be  Gaussian.  Formally,  the  set  of  pose-change  mea¬ 
surements  y  is  defined  as: 

Different  pose  estimation  paradigms  will  return  differ¬ 
ent  pose-change  measurements: 

•  The  differential  tracker  compute  the  relative  pose  be¬ 
tween  the  current  frame  and  the  previous  frame,  and 
returns  the  pose  change-measurements  with  co- 
variance  Aj_  , .  Section  4.1  describes  the  view  regis¬ 
tration  algorithm. 
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•  The  keyframe  tracker  uses  the  same  view  registration 
algorithm  described  in  Section  4.1  to  compute  the 
pose-change  measurements  {jy^  ,  AytK  }  between 
the  current  frame  and  the  selected  keyframes  frames. 

MAVAM  integrates  two  estimation  paradigms.  Sec¬ 
tion  4  describes  how  the  pose-change  measurements  are 
computed  for  head  pose  estimation. 

3.3  Model  Adaptation  and  Pose  Estimation 

To  estimate  the  pose  xt  of  the  new  frame  based  on  the 
pose-change  measurements,  we  use  the  Kalman  filter 
formulation  described  in  (Morency  et  ah,  2003).  The 
state  vector  X  is  the  concatenation  of  the  view  poses 
{xtjXt-iX/CoiX^jX/Cz, . . .}  as  described  in  Section  3 
and  the  observation  vector  y  is  the  concatenation  of 
the  pose  measurement  {ytt-1,ytKo,ytKl,ytK2,  ■  ■  ■}  as  de¬ 
scribed  in  the  previous  section.  The  covariance  between 
the  components  of  X  is  denoted  by  Ay. 

The  Kalman  filter  update  computes  a  prior  for 
p(Xt\yi..t-i)  by  propagating  |3^i..t_i)  one  step 

forward  using  a  dynamic  model.  Each  pose-change  mea¬ 
surement  y\  £  y  between  the  current  frame  and  a  base 
frame  of  X  is  modeled  as  having  come  from: 

Vs  =  ClX  +  tv, 

Cl  =  [I  0  •  •  •  -I  •  •  •  0  ]  , 

where  tv  is  Gaussian  and  C*  is  equal  to  I  at  the  view  t, 
equal  to  —I  for  the  view  s  and  is  zero  everywhere  else. 
Each  pose-change  measurement  (j/‘,  Ayt)  is  used  to  up¬ 
date  all  poses  using  the  Kalman  Filter  state  update: 

[A^t]_1=  [A^r1  +  cf  (1) 

Xt  =  A*t  +  (2) 

After  individually  incorporating  the  pose-changes 
(yl,  Ayt)  using  this  update,  Xt  is  the  mean  of  the 
posterior  distribution  p(M.\y). 

3.4  Online  Keyframe  Acquisition  and  Man¬ 
agement 

An  important  advantage  of  MAVAM  is  the  fact  that 
keyframes  are  acquired  online  during  tracking.  MAVAM 


generalized  the  previous  AVAM  (Morency  et  ah,  2003) 
by  (1)  extending  the  tesselation  space  from  3D  to  4D  by 
including  the  depth  of  the  object  as  the  forth  dimension 
and  (2)  adding  an  extra  step  of  keyframe  management  to 
ensure  a  constant  tesselation  of  the  pose  space. 

After  estimating  the  current  frame  pose  xt,  MAVAM 
must  decide  whether  the  frame  should  be  inserted  into  the 
view-based  model  as  a  keyframe  or  not.  The  goal  of  the 
keyframes  is  to  represent  all  different  views  of  the  head 
while  keeping  the  number  of  keyframes  low.  In  MAVAM, 
we  use  4  dimensions  to  model  the  wide  range  of  appear¬ 
ance.  The  first  three  dimensions  are  the  three  rotational 
axis  (i.e.,  yaw,  pitch  and  roll)  and  the  last  dimension  is 
the  depth  of  the  head.  This  fourth  dimension  was  added 
to  the  view-based  model  since  the  image  resolution  of  the 
face  changes  when  the  user  moves  forward  or  backward 
and  maintaining  keyframes  at  different  depths  improves 
the  base  frame  set  selection. 

In  our  experiments,  the  pose  space  is  tessellated  in  bins 
of  equal  size:  10  degrees  for  the  rotational  axis  and  100 
millimeters  for  the  depth  dimension.  These  bin  sizes  were 
set  to  the  pose  differences  that  our  pose-change  measure¬ 
ment  algorithm  (described  in  Section  4.1)  can  accurately 
estimate. 

The  current  frame  (It,xt)  is  added  as  a  keyframe  if  ei¬ 
ther  (1)  no  keyframe  exists  already  around  the  pose  xt  and 
its  variance  is  smaller  than  a  threshold,  or  (2)  the  keyframe 
closest  to  the  current  frame  pose  has  a  larger  variance  than 
the  current  frame.  The  variance  of  Xi  is  defined  as  the 
trace  of  its  associated  covariance  matrix  AXi . 

The  keyframe  management  step  ensures  that  the  orig¬ 
inal  pose  tessellation  stays  constant  and  no  more  than 
one  keyframe  represents  the  same  space  bin.  During 
the  keyframe  adaptation  step  described  in  Section  3.3, 
keyframe  poses  are  updated  and  some  keyframes  may 
have  shifted  from  their  original  poses.  The  keyframe 
management  goes  through  each  tesselation  bin  from  our 
view-based  model  and  check  if  more  than  one  keyframe 
pose  is  the  region  of  that  bin.  If  this  is  the  case,  then 
the  keyframe  with  the  lowest  variance  is  kept  while  all 
the  other  keyframes  are  removed  from  the  model.  This 
process  improves  the  performance  of  our  MAVAM  frame¬ 
work  by  compacting  the  view-based  model. 
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4  Monocular  Head  Pose  Estimation 

In  this  subsection  we  describe  in  detail  how  the  pose- 
change  measurements  y\  are  computed  for  the  different 
paradigms.  For  the  differential  and  keyframe  tracking, 
y\_ !  and  ylK  are  computed  using  Iterative  Normal  Flow 
Constraint  described  in  the  next  section. 

4.1  Monocular  Iterative  Normal  Flow  Con¬ 
straint 

Our  goal  is  to  estimate  the  6-DOF  transformation  between 
a  frame  with  known  pose  (Is,xs)  and  a  new  frame  with 
unknown  pose  It.  Our  approach  is  to  use  a  simple  3D 
model  of  the  head  (half  of  an  ellipsoid)  and  an  iterative 
version  of  the  Normal  Flow  Constraint  (NFC)  (Vedula 
et  ah,  1999).  Since  pose  is  known  for  the  base  frame 
(Is,xs),  we  can  position  the  ellipsoid  based  on  its  pose 
xs  and  use  it  to  solve  the  NFC  linear  system.  The  Algo¬ 
rithm  3  shows  the  details  of  our  iterative  NFC. 

5  Experiments 

The  goal  is  to  evaluate  the  accuracy  and  robustness  of 
the  MAVAM  tracking  framework  on  previously  published 
datasets.  The  following  section  describes  these  datasets 
while  Section  5.2  presents  the  details  of  the  models  com¬ 
pared  in  our  experiments.  Our  results  are  shown  in  Sec¬ 
tions  5.3  and  5.4.  Our  C++  implementation  of  MAVAM 
runs  at  12Hz  on  one  core  of  an  Intel  X535  Quad-core  pro¬ 
cessor.  The  system  was  automatically  initialized  using  the 
static  pose  estimator  described  in  the  previous  section. 

5.1  Datasets 

We  evaluated  the  performance  of  our  approach  on  two 
different  datasets:  the  BU  dataset  from  La  Cascia  et 
al  (La  Cascia  et  ah,  2000)  and  the  MIT  dataset  from 
Morency  et  al.  (Morency  et  ah,  2003). 

BU  dataset  consists  of  45  sequences  (nine  sequences 
for  each  of  five  subjects)  taken  under  uniform  illumination 
where  the  subjects  perform  free  head  motion  including 
translations  and  both  in-plane  and  out-of-plane  rotations. 
All  the  sequences  are  200  frames  long  (approximatively 
seven  seconds)  and  contain  free  head  motion  of  several 


Algorithm  3  Iterative  Normal  Flow  Constraint  Given 
the  current  frame  It,  a  base  frame  (Is,xs)  and  the  inter¬ 
nal  camera  calibration  for  both  images,  returns  the  pose- 
change  measurement  y\  between  both  frames  and  its  as¬ 
sociated  covariance  A yt . 

Compute  initial  transformation  Set  initial  value  for 
yl  as  the  2D  translation  between  the  face  hypotheses  for 
the  current  frame  and  the  base  frame  (see  Section  3.1 
Texture  the  ellipsoid  model  Position  the  ellipsoid  head 
model  at  xs+y\.  Map  the  texture  from  Is  on  the  ellip¬ 
soid  model  by  using  the  calibration  information 
repeat 

Project  ellipsoid  model  Back-project  the  textured 
ellipsoid  in  the  current  frame  using  the  calibration 
information. 

Normal  Flow  Constraint  Create  a  linear  system  by 
applying  the  normal  flow  constraint  (Vedula  et  al., 
1999)  to  each  valid  pixel  in  the  current  frame. 

Solve  linear  system  Estimate  Ayt  by  solving  the 
NFC  linear  system  using  linear  least  square.  Update 
the  pose-change  measurement  ys  =  ys  + 
Ayt  and  estimate  the  covariance  matrix  A yt  (Law- 
son  and  Hanson,  1974). 

Warp  ellipsoid  model  Apply  the  transformation 
Ayt  to  the  ellipsoid  head  model 
until  Maximum  number  of  iterations  reached  or  con¬ 
vergence:  trace(Ayt)  <  Ta 


subjects.  Ground  truth  for  these  sequences  was  simul¬ 
taneously  collected  via  a  “Flock  of  Birds”  3D  magnetic 
tracker  (??,  flock).  The  video  signal  was  digitized  at  30 
frames  per  second  at  a  resolution  of  320x240.  Since  the 
focal  length  of  the  camera  is  unknown,  we  approximated 
it  to  500  (in  pixel)  by  using  the  size  of  the  faces  and  know¬ 
ing  that  they  should  be  sitting  approximately  one  meter 
from  the  camera.  This  approximate  focal  length  add  chal¬ 
lenges  to  this  dataset.  MIT  dataset  contains  4  video  se¬ 
quences  with  ground  truth  poses  obtained  from  an  Iner¬ 
tia  Cube2  sensor.  The  sequences  were  recorded  at  6  Hz 
and  the  average  length  is  801  frames  (~133sec).  Dur¬ 
ing  recording,  subjects  underwent  rotations  of  about  125 
degrees  and  translations  of  about  90cm,  including  trans¬ 
lation  along  the  Z  axis.  The  sequences  were  originally 
recorded  using  a  stereo  camera  from  Videre  Design  (De- 
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Technique 

Tx 

Ty 

Tz 

MAVAM 

l.OOin 

0.88in 

1.82in 

Technique 

Pitch 

Yaw 

Roll 

MAVAM 

3.73° 

5.44° 

2.79° 

Table  1:  Average  accuracies  on  BU  dataset  (La  Cascia 
et  al.,  2000).  MAVAM  successfully  tracked  all  45  se¬ 
quences  while  La  Cascia  et  al.  (La  Cascia  et  al.,  2000) 
reported  an  average  percentage  of  tracked  frame  of  only 
~75%. 

sign,  2000).  For  our  experiments,  we  used  only  the  left 
images.  The  exact  focal  length  was  known.  By  sensing 
gravity  and  earth  magnetic  field.  Inertia  Cube2  estimates 
for  the  axis  X  and  Z  axis  (where  Z  points  outside  the  cam¬ 
era  and  Y  points  up)  are  mostly  driftless  but  the  Y  axis  can 
suffer  from  drift.  InterSense  reports  a  absolute  pose  accu¬ 
racy  of  3°RMS  when  the  sensor  is  moving.  This  dataset 
is  particularly  challenging  since  the  recorded  frame  rate 
was  low  and  so  the  pose  differences  between  frames  will 
be  larger. 

5.2  Models 

We  compared  two  models  for  head  pose  estimation:  our 
approach  MAVAM  as  described  in  this  paper,  and  the 
original  stereo-based  AVAM  (Morency  et  al.,  2003). 

MAVAM  The  Monocular  Adaptive  View-based  Ap¬ 
pearance  Model  (MAVAM)  is  the  complete  model  as  de¬ 
scribed  in  Section  3.  This  model  integrates  two  pose  es¬ 
timation  paradigms:  differential  tracking  and  keyframe 
tracking.  It  is  applied  on  monocular  intensity  images. 

3D  AVAM  The  stereo-based  AVAM  is  the  original 
model  suggested  by  Morency  et  al.  (Morency  et  al., 
2003).  The  results  for  this  model  are  taken  directly  from 
their  research  paper.  Since  this  model  uses  intensity  im¬ 
ages  as  well  as  depth  images,  we  should  expect  better  ac¬ 
curacy  for  this  3D  AVAM. 

5.3  Results  with  BU  dataset 

The  BU  dataset  presented  in  (La  Cascia  et  al.,  2000)  con¬ 
tains  45  video  sequences  from  5  different  people.  The 
results  published  by  La  Cascia  et  al.  are  based  on  three 
error  criteria:  the  average  %  of  frames  tracked,  the  po¬ 


Technique 

Pitch 

Yaw 

Roll 

MAVAM 

5.3°  ±  15.3° 

4.9°  ±  9.6° 

3.6°  ±6.3° 

3D  AVAM 

2.4° 

3.5° 

2.6° 

Table  2:  Average  rotational  accuracies  on  MIT 

dataset  (Morency  et  al.,  2003).  MAVAM  performs  al¬ 
most  as  well  as  the  3D  AVAM  which  was  using  stereo 
calibrated  images  while  our  MAVAM  works  with  monoc¬ 
ular  intensity  images. 

sition  error  and  the  orientation  error.  The  position  and 
orientation  errors  includes  only  the  tracked  frames  and 
ignores  all  frames  with  very  large  error.  In  our  results, 
the  MAVAM  successfully  tracked  all  45  video  sequences 
without  losing  track  at  any  point.  The  Table  1  shows  the 
accuracy  of  our  MAVAM  pose  estimator.  The  average  ro¬ 
tational  accuracy  is  3.9°  while  the  average  position  error 
is  1.2inches(  3.1cm).  These  results  show  that  MAVAM  is 
accurate  and  robust  even  when  the  focal  length  can  only 
be  approximated. 

5.4  Results  with  MIT  dataset 

The  MIT  dataset  presented  in  (Morency  et  al.,  2003)  con¬ 
tains  four  long  video  sequences  (~2mins)  with  a  large 
range  of  rotation  and  translation.  Since  the  ground  truth 
head  positions  were  not  available  for  this  dataset,  we 
present  results  for  pose  angle  estimates  only.  Table  2 
shows  the  averaged  angular  error  the  different  models. 
The  results  for  3D  AVAM  were  taken  for  the  original  pub¬ 
lication  (Morency  et  al.,  2003).  We  can  see  that  MAVAM 
performs  almost  as  well  as  the  3D  AVAM  which  was  using 
stereo  calibrated  images  while  our  MAVAM  works  with 
monocular  intensity  images. 

6  Conclusion 

In  this  paper,  we  presented  a  probabilistic  frame¬ 
work  called  Monocular  Adaptive  View-based  Appearance 
Model  (MAVAM)  which  integrates  the  advantages  from 
three  of  these  approaches:  (1)  the  relative  precision  and 
user-independence  of  differential  registration,  and  (2)  the 
robustness  and  bounded  drift  of  keyframe  tracking.  On 
two  challenging  3-D  head  pose  datasets,  we  demonstrated 
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that  MAVAM  can  reliably  and  accurately  estimate  head 
pose  and  position  using  a  simple  monocular  camera. 
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